Susan Li .(2019 , Apr 1). Building a Content Based Recommender System for Hotels in Seattle .Retrieved June 27,2019 ,from https://towardsdatascience.com/building-a-content-based-recommender-system-for-hotels-in-seattle-d724f0a32070
推薦系統 ( Recommendation system )有一個眾所周知的問題,便是在 Cold Start Problem 上推薦系統無法有效地將推薦項目推送給使用者。在新用戶、新產品或是新網站、平台上,由於沒有足夠多的資料,推薦系統很難建構一個適用的 model 來進行推薦。
此篇文章,作者 Susan Li 使用了 Content-Based Filter1 來做為解決 Cold Start Problem 的方法,Content-Based Recommendation System 可以適用於各種不同的領域,而且沒有 Cold Start Problem,在產品、網站一開始上線後就可以做出有效的推薦。
Scenario
作者現在模擬一個情境,我們是一個新的 Online Travel Agency ( OTA ,類似於台灣易遊網、Hotels.com...),且有數千家飯店旅館會在我們的平台銷售。由於我們是新的平台,並沒有太多用戶資料,我們要建立一套 Content-Based Recommendation System ,利用飯店本身的商品陳述來判斷是否符合使用者需求而進行精準的推薦投放。
from nltk.corpus import stopwords import nltk nltk.download('stopwords')
from sklearn.metrics.pairwise import linear_kernel from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation
import re import random
import plotly.graph_objs as go import plotly.plotly as py import cufflinks pd.options.display.max_columns = 30 from IPython.core.interactiveshell import InteractiveShell import plotly.figure_factory as ff InteractiveShell.ast_node_interactivity = 'all' from plotly.offline import iplot cufflinks.go_offline() cufflinks.set_config_file(world_readable=True, theme='solar')
1 2 3
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1") df.head() print('We have ', len(df), 'hotels in the data')
print("Number of descriptions:",len(desc_lengths), "\nAverage word count", np.average(desc_lengths), "\nMinimum word count", min(desc_lengths), "\nMaximum word count", max(desc_lengths))
1 2 3 4 5 6 7
df['word_count'].iplot( kind='hist', bins = 50, linecolor='black', xTitle='word count', yTitle='count', title='Word Count Distribution in Hotel Description')
defclean_text(text): """ text: a string return: modified initial string """ text = text.lower() # lowercase text text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space. text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. text = ' '.join(word for word in text.split() if word notin STOPWORDS) # remove stopwors from text return text df['desc_clean'] = df['desc'].apply(clean_text)
defrecommendations(name, cosine_similarities = cosine_similarities): recommended_hotels = [] # gettin the index of the hotel that matches the name # 找出輸入飯店名稱的 index idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order # 利用輸入飯店index,從 cosine similarity 矩陣找出這間飯店與其他飯店的 cosine similarity value 並且排序 score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar hotels except itself # 從這些排序後的 cosine similarity 取前10 top_10_indexes = list(score_series.iloc[1:11].index) # populating the list with the names of the top 10 matching hotels for i in top_10_indexes: recommended_hotels.append(list(df.index)[i]) return recommended_hotels
Recommendations
試著輸入 " Hilton Seattle Airport & Conference Center " 來看看推薦名單。